46 research outputs found
Software effort estimation based on optimized model tree
Background: It is widely recognized that software effort estimation is a
regression problem. Model Tree (MT) is one of the Machine Learning based
regression techniques that is useful for software effort estimation, but as
other machine learning algorithms, the MT has a large space of configuration
and requires to carefully setting its parameters. The choice of such parameters
is a dataset dependent so no general guideline can govern this process which
forms the motivation of this work. Aims: This study investigates the effect of
using the most recent optimization algorithm called Bees algorithm to specify
the optimal choice of MT parameters that fit a dataset and therefore improve
prediction accuracy. Method: We used MT with optimal parameters identified by
the Bees algorithm to construct software effort estimation model. The model has
been validated over eight datasets come from two main sources: PROMISE and
ISBSG. Also we used 3-Fold cross validation to empirically assess the
prediction accuracies of different estimation models. As benchmark, results are
also compared to those obtained with Stepwise Regression Case-Based Reasoning
and Multi-Layer Perceptron. Results: The results obtained from combination of
MT and Bees algorithm are encouraging and outperforms other well-known
estimation methods applied on employed datasets. They are also interesting
enough to suggest the effectiveness of MT among the techniques that are
suitable for effort estimation. Conclusions: The use of the Bees algorithm
enabled us to automatically find optimal MT parameters required to construct
effort estimation models that fit each individual dataset. Also it provided a
significant improvement on prediction accuracy
Analogy-based effort estimation: a new method to discover set of analogies from dataset characteristics
Analogy-based effort estimation (ABE) is one of the efficient methods for
software effort estimation because of its outstanding performance and
capability of handling noisy datasets. Conventional ABE models usually use the
same number of analogies for all projects in the datasets in order to make good
estimates. The authors' claim is that using same number of analogies may
produce overall best performance for the whole dataset but not necessarily best
performance for each individual project. Therefore there is a need to better
understand the dataset characteristics in order to discover the optimum set of
analogies for each project rather than using a static k nearest projects.
Method: We propose a new technique based on Bisecting k-medoids clustering
algorithm to come up with the best set of analogies for each individual project
before making the prediction. Results & Conclusions: With Bisecting k-medoids
it is possible to better understand the dataset characteristic, and
automatically find best set of analogies for each test project. Performance
figures of the proposed estimation method are promising and better than those
of other regular ABE model
Analyzing the Relationship between Project Productivity and Environment Factors in the Use Case Points Method
Project productivity is a key factor for producing effort estimates from Use
Case Points (UCP), especially when the historical dataset is absent. The first
versions of UCP effort estimation models used a fixed number or very limited
numbers of productivity ratios for all new projects. These approaches have not
been well examined over a large number of projects so the validity of these
studies was a matter for criticism. The newly available large software datasets
allow us to perform further research on the usefulness of productivity for
effort estimation of software development. Specifically, we studied the
relationship between project productivity and UCP environmental factors, as
they have a significant impact on the amount of productivity needed for a
software project. Therefore, we designed four studies, using various
classification and regression methods, to examine the usefulness of that
relationship and its impact on UCP effort estimation. The results we obtained
are encouraging and show potential improvement in effort estimation.
Furthermore, the efficiency of that relationship is better over a dataset that
comes from industry because of the quality of data collection. Our comment on
the findings is that it is better to exclude environmental factors from
calculating UCP and make them available only for computing productivity. The
study also encourages project managers to understand how to better assess the
environmental factors as they do have a significant impact on productivityComment: Journal of Software: Evolution and Process, 201
Model tree based adaption strategy for software effort estimation by analogy
Background: Adaptation technique is a crucial task for analogy based
estimation. Current adaptation techniques often use linear size or linear
similarity adjustment mechanisms which are often not suitable for datasets that
have complex structure with many categorical attributes. Furthermore, the use
of nonlinear adaptation technique such as neural network and genetic algorithms
needs many user interactions and parameters optimization for configuring them
(such as network model, number of neurons, activation functions, training
functions, mutation, selection, crossover, ... etc.). Aims: In response to the
abovementioned challenges, the present paper proposes a new adaptation strategy
using Model Tree based attribute distance to adjust estimation by analogy and
derive new estimates. Using Model Tree has an advantage to deal with
categorical attributes, minimize user interaction and improve efficiency of
model learning through classification. Method: Seven well known datasets have
been used with 3-Fold cross validation to empirically validate the proposed
approach. The proposed method has been investigated using various K analogies
from 1 to 3. Results: Experimental results showed that the proposed approach
produced better results when compared with those obtained by using estimation
by analogy based linear size adaptation, linear similarity adaptation,
'regression towards the mean' and null adaptation. Conclusions: Model Tree
could form a useful extension for estimation by analogy especially for complex
data sets with large number of categorical attributes
A Comparative Study for Predicting Heart Diseases Using Data Mining Classification Methods
Improving the precision of heart diseases detection has been investigated by
many researchers in the literature. Such improvement induced by the
overwhelming health care expenditures and erroneous diagnosis. As a result,
various methodologies have been proposed to analyze the disease factors aiming
to decrease the physicians practice variation and reduce medical costs and
errors. In this paper, our main motivation is to develop an effective
intelligent medical decision support system based on data mining techniques. In
this context, five data mining classifying algorithms, with large datasets,
have been utilized to assess and analyze the risk factors statistically related
to heart diseases in order to compare the performance of the implemented
classifiers (e.g., Na\"ive Bayes, Decision Tree, Discriminant, Random Forest,
and Support Vector Machine). To underscore the practical viability of our
approach, the selected classifiers have been implemented using MATLAB tool with
two datasets. Results of the conducted experiments showed that all
classification algorithms are predictive and can give relatively correct
answer. However, the decision tree outperforms other classifiers with an
accuracy rate of 99.0% followed by Random forest. That is the case because both
of them have relatively same mechanism but the Random forest can build ensemble
of decision tree. Although ensemble learning has been proved to produce
superior results, but in our case the decision tree has outperformed its
ensemble version
An empirical evaluation of ensemble adjustment methods for analogy-based effort estimation
Objective: This paper investigates the potential of ensemble learning for
variants of adjustment methods used in analogy-based effort estimation. The
number k of analogies to be used is also investigated. Method We perform a
large scale comparison study where many ensembles constructed from n out of 40
possible valid variants of adjustment methods are applied to eight datasets.
The performance of each method was evaluated based on standardized accuracy and
effect size. Results: The results have been subjected to statistical
significance testing, and show reasonable significant improvements on the
predictive performance where ensemble methods are applied. Conclusion: Our
conclusions suggest that ensembles of adjustment methods can work well and
achieve good performance, even though they are not always superior to single
methods. We also recommend constructing ensembles from only linear adjustment
methods, as they have shown better performance and were frequently ranked
higher
Learning best K analogies from data distribution for case-based software effort estimation
Case-Based Reasoning (CBR) has been widely used to generate good software
effort estimates. The predictive performance of CBR is a dataset dependent and
subject to extremely large space of configuration possibilities. Regardless of
the type of adaptation technique, deciding on the optimal number of similar
cases to be used before applying CBR is a key challenge. In this paper we
propose a new technique based on Bisecting k-medoids clustering algorithm to
better understanding the structure of a dataset and discovering the the optimal
cases for each individual project by excluding irrelevant cases. Results
obtained showed that understanding of the data characteristic prior prediction
stage can help in automatically finding the best number of cases for each test
project. Performance figures of the proposed estimation method are better than
those of other regular K-based CBR methods.Comment: arXiv admin note: substantial text overlap with arXiv: 1703.0456
Fuzzy Model Tree For Early Effort Estimation
Use Case Points (UCP) is a well-known method to estimate the project size,
based on Use Case diagram, at early phases of software development. Although
the Use Case diagram is widely accepted as a de-facto model for analyzing
object oriented software requirements over the world, UCP method did not take
sufficient amount of attention because, as yet, there is no consensus on how to
produce software effort from UCP. This paper aims to study the potential of
using Fuzzy Model Tree to derive effort estimates based on UCP size measure
using a dataset collected for that purpose. The proposed approach has been
validated against Treeboost model, Multiple Linear Regression and classical
effort estimation based on the UCP model. The obtained results are promising
and show better performance than those obtained by classical UCP, Multiple
Linear Regression and slightly better than those obtained by Tree boost model
v-SVR Polynomial Kernel for Predicting the Defect Density in New Software Projects
An important product measure to determine the effectiveness of software
processes is the defect density (DD). In this study, we propose the application
of support vector regression (SVR) to predict the DD of new software projects
obtained from the International Software Benchmarking Standards Group (ISBSG)
Release 2018 data set. Two types of SVR (e-SVR and v-SVR) were applied to train
and test these projects. Each SVR used four types of kernels. The prediction
accuracy of each SVR was compared to that of a statistical regression (i.e., a
simple linear regression, SLR). Statistical significance test showed that v-SVR
with polynomial kernel was better than that of SLR when new software projects
were developed on mainframes and coded in programming languages of third
generationComment: 6 pages, accepted at Special Session: ML for Predictive Models in
Eng. Applications at the 17th IEEE International Conference on Machine
Learning and Applications, 17th IEEE ICMLA 201
A Comparison Between Decision Trees and Decision Tree Forest Models for Software Development Effort Estimation
Accurate software effort estimation has been a challenge for many software
practitioners and project managers. Underestimation leads to disruption in the
projects estimated cost and delivery. On the other hand, overestimation causes
outbidding and financial losses in business. Many software estimation models
exist; however, none have been proven to be the best in all situations. In this
paper, a decision tree forest (DTF) model is compared to a traditional decision
tree (DT) model, as well as a multiple linear regression model (MLR). The
evaluation was conducted using ISBSG and Desharnais industrial datasets.
Results show that the DTF model is competitive and can be used as an
alternative in software effort prediction.Comment: 3rd International Conference on Communications and Information
Technology (ICCIT), Beirut, Lebanon, pp. 220-224, 201